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Abstract 

The PARADIGM compiler project provides an automated 
means to parallelize programs, written in a serial program- 
ming model, for efficient execution on distributed-memory 
muiticomputers. A previous implementation of the com- 
piler based on the PTD representation allowed symbolic ar- 
ray sizes, affine loop bounds and array subscripts, and vari- 
able number of processors, provided that arrays were single- 
or multi-dimensionallv block distributed. The techniques 
presented here extend the compiler to also accept multi- 
dimensional cyclic and block-cyclic distributions within 
a uniform symbolic framework. These extensions demand 
more sophisticated symbolic manipulation capabilities. A 
novel aspect of our approach is to meet this demand by in- 
terfacing PARADIGM with a powerful off-the-shelf symbolic 
package, Mathematics . ™ . This paper describes some of the 
Matherriatica ™ routines that performs various transforma- 
tions, shows how they are invoked and used by the compiler 
to overcome the new challenges, and presents experimental 
results for code involving cyclic and block-cyclic arrays 
as evidence of the feasibility of the approach. 

1 Introduction 

Distributed-memory multicomputers offer significant advan- 
tages over shared-memory multiprocessors in terms of cost 
and scalability. Unfortunately, extracting all the compu- 
tational power from these machines requires users to write 
efficient software for them, which is a laborious process. One 
major reason for this difficulty is the absence of a global ad- 
dress space. As a result, the programmer has to distribute ' 
code and data across processors and manage communication 
among tasks explicitly. 
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The PARADIGM project at the University of Illinois ad- 
dresses this problem by developing an automated means to 
parallelize and optimize sequential programs for efficient ex- 
ecution on multicomputers. It uses Parafrase-2 (21) as a pre- 
processing platform to parse the sequential program into an 
intermediate representation, to analyze the code and gener- 
ate flow, dependence, and call graphs, and to perform trans- 
formations such as constant propagation and induction vari- 
able substitution. Some of the other major research efforts 
in this area include . Fortran D [16], Fortran 90D- [TO], the 
SUIF compiler [2], and the Superb compiler [11]. 

In addition to the traditional compiler optimizations to 
distribute computations and to reduce communication over- 
heads. PARADIGM is unique in its ability to: (1) option- 
ally parse High Performance Fortran (HPF) directives [17]; 
(2) perform automatic data distribution for regular com- 
putations, conventionally only specified through user direc- 
tives; (3) generate high-level communication primitives; (4) 
optimize communication for regular computations; (5) sup- 
port irregular computations using a combination of compile- 
time analysis and run-time support; (6) exploit functional 
and data parallelism simultaneously; and (7) generate mul- 
tithreaded message-driven code to tolerate communication 
latencies. Current efforts in the project aim at integrating 
all of these features into the same framework [5], Figure 1 
shows a functionai illustration of how we envision the com- 
plete PARADIGM compilation system. The compiler ac- 
cepts either a sequential FORTRAN 77 or HPF program 
and produces a parallel program optimized for a target ma- 
chine. 

Previously, PARADIGM was implemented using Proces- 
sor Tagged Descriptors [22] (PTD) as the underlying data 
structure to provide a uniform representation to describe 
both distributed arrays and partitioned loops. This imple- 
mentation allowed symbolic array sizes, multi-dimensional 
distributions, variable number of processors, and affine loop 
bounds and array subscripts, but the distribution type was 
limited to block. Later extensions to PTD proved effective 
for purely cyclic distributions. 

This paper presents techniques to support block-cyclic 
distributions, in addition to block and cyclic, all within a 
unified framework. The techniques are based on the Fourier- 
Motzkin elimination method [6] extended with many sym- 
bolic capabilities required by the compiler. A novel aspect 
of our approach is to meet the increasing demands for more 
sophisticated symbolic manipulations by exploiting the well- 
established functionality of existing powerful symbolic soft- 
ware, in our case, Mathematica. Just as most parsers today 
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Figure 2: Linking PARADIGM with Mathematica 


internal representation based on Parafrase-2. is interfaced 
to Mathematica via MathLink. These tools are the result 
of only three man-months and represent roughly 1000 lines 
of Mathematica code and 700 lines of C code. The key to 
this quick turnaround is the extensive use of built-in Math- 
ematica functionality as well as public domain Mathematica 
packages. 


Figure 1: PARADIGM Compiler Overview 

are implemented using stable and wide-spread toois like yacc 
and lex instead of being built from, scratch, we argue that 
many parallelizing compilers in the future, in particular re- 
search prototypes developed in academic environments, will 
rely on off-the-shelf symbolic packages for most of their sym- 
bolic manipulations. Several reasons account for this: 

• New compilers axe required to perform increasingly 
complex symbolic tasks [9, 15, 22). 

• As compiler algorithms become more complex, there 
is a need to focus more on their designs rather than on 
their implementation details. 

• Many powerful symbolic packages are readily available 
(e.g., Mathematica. Maple), and some provide exter- 
nal interfaces (e.g., MathLink) with reasonable perfor- 
mance and robustness. 

Clearly, the performance is not comparable with a C im- 
plementation optimized for special cases, but compilation 
time has not been a major problem in our experience. Fur- 
thermore, it is not a good practice to fully tune compiler 
features until their impact on optimizing real codes has been 
testes [8]. The approach presented in this paper allows quick 
prototyping and testing of new algorithms with real pro- 
grams. Once the effectiveness of an algorithm is ..verified, 
then the critical features affecting efficiency and compila- 
tion time can be identified and rewritten in C. 

The rest of this paper is organized as follows. Section 2 
describes some of the Mathematica packages that we have 
written to meet the symbolic demands of the compiler. Sec- 
tion 3 shows how PARADIGM performs computation parti- 
tioning and communication generation by building inequal- 
ity expressions from the source program and passing them to 
the Fourier-Motzkin package in Mathematica to obtain nec- 
essary symbolic expressions. Some results for two programs 
involving cyclic and block-cyclic array distributions, re- 
spectively, appear in Section 4. Finally, concluding remarks 
are made in Section 5. 

2 Mathematica Tools in PARADIGM 

This section describes a set of tools that we have built using 
Mathematica and shows how P.ARADIGM, which uses an 


2.1 Linking PARADIGM with Mathematica 

An overview of the interactions between PARADIGM and 
Mathematica is shown in Figure 2. This follows a typical 
client/server model in which the server ( Mathematica ) waits 
for a “symbolic request,” processes it, and returns the re- 
sult to the client. Both run as separate UNIX processes and 
communicate with each other using MathLink. The under- 
lying communication mechanism can be either UNIX pipes, 
when they are running on the same machine, or TCP/IP 
sockets for remote execution. 

Efficient forward and reverse conversion routines transfer 
data between PARADIGM’S internal representation (based 
on Parafrase-2) and Mathematica expressions ( P2ToMath 
and MathToP2 in the figure). Conversion time is kept to 
a minimum using several techniques: 

• The structure of the expression is never lost. We use 
MathLink API function calls to send and receive a to- 
kenized stream. 

• Variable names Eire transformed so that they encode a 
pointer to the symbol table. This eliminates symbol 
table look-ups when expressions return from Mathe- 
matica. 

• Mathematica also pre-processes the result ( PreMath - 
ToP2 in the figure) so that functions have the correct 
number of arguments and easy-to-decode names fol- 
lowing a preset convention. 


An Example: Simplifying Loop Bounds To illustrate 
the interactions between PARADIGM and Mathematica we 
use a simple- example, namely the simplification of loop 
bounds. After certain loop transformations, such as loop 
normalization, bound expressions can become very complex, 
and in particular, when arrays with multiple levels of indi- 
rection are involved, it is difficult for conventional compilers 
to “clean up the mess.” 

Figure 3 shows out solution to this problem using Math- 
ematica. The PARADIGM process first opens a connection 
with the Mathematica process and then traverses the hierar- 
chical representation of the loop nest and finds the lower and 
upper bound expressions. Each of these expressions is trans- 
formed into a Mathematica expression using P2ToMath, and 
a Mathematica built-in function, Simplify , is applied to them 
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Figure 3: Simplifying Loop Bounds 


by che server. The resuit is sent back to the client where it 
is transformed into a PARADIGM expression by MathToPS 
and inserted into the appropriate data structures. 

A simple optimization that reduces the number of in- 
teractions between the processes is to generate a list with 
all the lower and upper bound expressions and send it for 
simplification as one stream. This is particularly important 
when che granularity of the Mathematica operations is small 
as it is in this case. 

2.2 Symbolic Fourier-Motzkin Elimination (FME) 

The Fourier-Motzkin method solves systems of linear in- 
equalities with real variables using an elimination method. 
In the context of compilers it has important applications, in- 
cluding data dependence testing [19], inter-procedural anal- 
ysis and. of particular importance to us, loop transforma- 
tions [7] and automatic generation of communication for 
muiticomputers [2, 3]. 

Our implementation quite closely follows che algorithm 
described by U. Banerjee [6] but also includes some of the 
conditions stated by M. Ancourt [4] to ensure that the in- 
teger projection is the same as the real projection. A fully 
symbolic implementation takes hardly 100 lines of Math- 
ematica code, making extensive use of built-in functional-' 
itv. Symbolic comparisons at every stage of the elimina- 
tion process significantly reduce the number of redundant 
constraints and improve performance. This is particularly 
efficient when the compiler can provide information about 
the signs of symbolic constants; which is usually the case 1 . 
Other redundant constraints are eliminated using traditional 
methods [4]. 

To improve the quality of the output of our implemen- 
tation of FME, we have extended the built-in\ simplification 
rules of Mathematica, providing extensive support for floor, 
ceiling, min and max operations. In particular, declaring 
unknown symbolic variables as integer and using the fact 
that integers are closed under certain operations (+, — , x), 
we can eliminate many unnecessary floor and ceiling opera- 
tions. In addition, using a public domain package 2 3 , expres- 

1 Symbolic coefficients with unknown signs give rise to multi- 

version code. 

3 Thanks to Stephan Kaufmann at the Swiss Federal Institute of 
Technology for providing the NonNegativeQ package. 


sions like 3a . d~d 64 - 2 can be determined as non-negative if a 
and b are first declared to be non- negative. This is used to 
eliminate operands in min and max operations by just sub- 
tracting operands and checking non-negativity of the result 
Since our compiler uses FME extensively to find various 
iteration sets (as will be described in Section 3), we shall 
briefly explain its interface. Its input consists of a set of 
inequalities I and a list of variables V. The order of elim- 
ination of the variables in V is from the innermost to the 
outermost loop. Therefore, after elimination, a variable will 
only depend on enclosing loop variables. Thus, the solu- 
tion set 72. that is returned consists of loop bounds for each 
variable in V. These loop bounds cam be used to scan the 
polvtope defined by the original set of inequalities 1. 

2.3 Loop Transformations 

The functionality of a loop transformation pass is split be- 
tween the PARADIGM process and the Mathematica pro- 
cess in the following way: the PARADIGM process identi- 
fies the loops involved in the transformation, checks that the 
transformation requested is valid using the data dependence 
and control flow graphs, establishes a connection with the 
Mathematica process, and sends it the relevant portions of 
the loop such as the loop bounds, array subscripts, and/or 
the entire loop body. The Mathematica process computes 
the new loop bounds after the transformation and a set of 
replacement rules to map the old loop variables into the new 
ones. It then returns the new loop bounds and loop body 
after applying the replacement rules. The PARADIGM pro- 
cess converts the results back into its internal representation. 

Loop Normalization Loop normalization is typically per- 
formed as a pre-pass of the compiler analysis and reduces the 
lower bounds and strides to one. PARADIGM requests this 
transformation whenever the loop stride is not one. The im- 
plementation is trivial in Mathematica using built-in replace- 
ment and simplification rules. Note that affine subscripts 
and loop bounds remain aifine after loop normalization. 

Unimoduiar Loop Transformations Unimoduiar loop 
transformations provide a uniform view to many important 
transformations [6] (e.g. loop interchange, loop skewing, 
and loop reversal) and are a powerful tool to increase par- 
allelism [7] and/or locality [23]. A unimoduiar loop trans- 
formation is described by a unimoduiar matrix, which is an 
integer matrix with a determinant of ±1. In our Mathe- 
matica implementation, unimoduiar matrices are allowed to 
have symbolic terms. In particular, the unimoduiarity of a 
symbolic matrix is verified before its inverse is computed us- 
ing built-in functions. Then, the corresponding loop trans- 
formations are applied to a set of loops by computing the 
new loop variables with the inverse unimoduiar matrix, gen- 
erating a set of inequalities from the loop bounds and the 
new loop variables, and finally applying FME to obtain the 
new, loop bounds. Roughly 20 lines of Mathematica code in 
addition to the Fourier-Motzkin package were required for 
performing unimoduiar loop transformations. 

Redundant Access Elimination Transformations The 

method to be described in Section 3.2 to generate commu- 
nication does not ensure that each element of an array is 
only transmitted once. Based on the early work of Gallivan 
et ai [13], later refined by Ancourt [4], we can obtain a new 
set of loops that access each element at most once. The 
main idea behind this transformation is the decomposition 
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of the inceger matrix that describes the access pattern into 
the product of other matrices with special characteristics. 
In our implementation, we used a Mathematica package 3 to 
obtain the Smith Normal Form and Hermite Column or Row 
Form of a matrix. Then, we can use the unimodular package 
described above to compute the new loop bounds. Currently 
PARADIGM is not taking advantage of this transformation, 
but it will in the near future. 

2.4 Other Tools of Interest 

In this section we briefly describe some other Mathematica 
tools that we have written that are useful in other areas of 
the compilation process. 

Integer Area Estimation For applications like automatic 
data partitioning [14] or static buffer allocation it is crucial 
to have compile-time estimates of computation and/or com- 
munication requirements. Good estimates require the com- 
putation of the integer volume of a polyhedron, but this is in 
general too coihpiex. Fortunately, in most practical cases we 
only need an estimate of this volume, and heuristics like the 
one proposed in f 1] are sufficient. We have a preliminary 2-D 
implementation to estimate the integer area, which extends 
the work in [1] by handling affine loop bounds. The integer 
area is approximated by the "real” area plus half the points 
in the boundary. Using a Mathematica package 4 , we obtain 
all the vertices of the polyhedron from the loop inequalities. 
Then, using buiit-in functionality, the vertices are ordered 
clock-wise and the ‘real” area is obtained through triangu- 
lation. Finally, the perimeter is computed by applying the 
GCD rule to consecutive points. 

We can also apply this method in cases when symbolic 
terms only scale the polyhedron without changing its shape. 
In that case, we instantiate the symbolic terms for several 
values, compute the integer area with the previous method 
for each instantiation, and obtain an interpolating polyno- 
mial with the symbolic variable for these values using built- 
in functionality. 

In cases where symbolic terms change the shape of the 
polyhedron, we cam apply FME to obtain a bounding box 
as described in [4]. Assuming an n-dimensional space and 
using FME to project n — 1 dimensions,, we can obtain real 
bounds for the remaining dimension. Rotating the dimen- 
sions and repeating this process we can compute bounds for 
every dimension. 

We are currently working on extensions to handle more 
than two dimensions and symbolic terms more effectively. 

Graphic Visualization The built-in plotting functions in 
Mathematica are used to visualize iteration space and array 
access patterns in two or three dimensions. By merging plots 
for different processors (using different colors for each), the 
sets for a particular communication or the load distribution 
after partitioning a computation cam be shown. Masks in 
plotting functions can be used to represent non-convex re- 
gions or filter certain types of points. By combining these 
plotting functions with FME, any convex set defined by lin- 
ear inequalities cam be visualized. This tool proved to be 
invaluable for debugging the compiler. 

3 Thanka to Brian L. Evans at Georgia Tech for providing the Lat- 
ticeTheory package. 

‘‘Thanks to Komei Fukuda and Ichiro Mizukoshi of University of 
Tsukuba, Tokyo for providing the n-dimensional vertex enumeration 
package. 


Gi — h Code Generation We have extended Mathemat- 
ical capabilities to output structured C/C-f-f loops, condi- 
tionals, local declarations, and function headers and calls 
from Mathematica expressions. Together with the built-in 
function Splice and a template file, it is relatively easy to 
generate simple C-H- code. We are using this capability to 
generate inspector code for irregular problems and link it 
with PARADIGM’S irregular run-time support library, PI- 
LAR [18]. 

3 Compilation Techniques 

This section will explain how PARADIGM compiles a loop 
nest containing computations on regularly distributed ar- 
rays. The arrays are allowed to have tin arbitrary number 
of dimensions, each of which can have a block, cyclic, or 
block-cyclic distribution, or be replicated along a proces- 
sor dimension, or be sequentialized on a single processor. 
The array subscripts and joop bounds are allowed to be 
affine functions of index variables of enclosing loops. Be- 
cause of the symbolic nature of the approach, the number 
of processors need not be specified at compile time, allow- 
ing the resulting program to execute on an arbitrarily sized 
system. For simplicity, global references are not translated 
into local address spaces and, therefore, arrays are not scaled 
down as more processors are used. We are currently looking 
for an efficient way that can handle this local translation ef- 
fectively and uniformly for all supported array distributions. 

The data distribution and alignment of each array dimen- 
sion is either automatically generated by PARADIGM [14] 
or provided by the user through standard HPF directives. 
Given the data distributions, the compiler still must carry 
out two major tasks, namely the partitioning of the com- 
putation across processors and the generation of communi- 
cation code to transfer data among processors) How the 
compiler performs these tasks is the main focus of this sec- 
tion. 

3.1 Computation Partitioning ' 

The key to the symbolic analyses and transformations re- 
quired for both of the aforementioned tasks is the use of 
the Mathematica tools described previously. The symbolic 
FME package is the main engine used to generate various 
iteration sets for different data access patterns. 

To exploit parallelism in the source program, a com- 
piler must somehow distribute computations across proces- 
sors. PARADIGM partitions loops using the owner com- 
putes rule: a processor p only executes those iterations for 
which the left-hand side (Ihs) array reference of an assign- 
ment statement is local to (owned by) p. Since determining 
ownership for each access at run time is prohibitively costly, 
the compiler, through a process called loop bounds reduc- 
tion [16], derives new loop bounds confined to only p : s local 
iterations. This is to say that the new loop must only tra- 
verse the Access set [22] of the Ihs reference. Consider the 
loop: 

do i = L, U 

■ ■ ■ A(2i + 3) • • • 

end do 

where A(l : u) is block distributed on P processors indexed 
byp, 0 < p < p-l. By definition, Access (A(s(i)),p) is the 
set of all iterations i such that A{s(i)) is stored in p. This 
set is obtained from the polytope 1Z returned by FME when 
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called with the inequalities X constraining the loop variable 
i, which is also the polytope's only axis variable in this exam- 
ple (V = i). Two types of inequalities can be distinguished: 
loop inequalities and data inequalities. The former arises 
from. loop bounds and specifies constraints on the loop vari- 
ables and the relationships among them, while the latter is 
due to data distribution and describes the relationships be- 
tween processor coordinates and subscript functions (hence 
loop variables, unless the subscript is a constant). In this 
.example, 

1 = 


where 6 = 
returns: 

71 


The symbolic capabilities of Mathematica allow FME to 
treat b. p , P, L , U. 1. and u as literals, producing expressions 
that are parameterized by them, which have many advan- 
• tages: 

• The same code runs on every processor p, regardless 
of its location in the mesh. 

• The number of processors (P) in the mesh need not be 
known at compile time. 

• The values of the loop bounds, L and U, may be input 
at run time. 

\ 

• The array bounds, l and u, need not be compile-time 
constants. 

The same framework applies to block-cyclic distribu- 
tions. An array with a block-cyclic distribution with block 
size b (written cycli c(6)) can be viewed as a 2-D b x N ar- 
ray on each processor p, where N = ( [ ~ ] — l) is the 

number of blocks that p has. The blocks are enumerated 
with a block number n, 0 < n < N — 1, and the heads of 
any two consecutive blocks of the same processor tire at a 
distance bP apart. The nth block in p therefore contains- 
array elements from 6p + / + bnP to bp + l + bnP + 6 — 1, 
inclusive 0 . For the same loop as before, the input to FME 
becomes: 

L <i <U (loop ineq.) "1 

0 < n < j (data ineq.) I 

bp + l + bnP < 2i + 3 < 

[ bp + I + bnP +6 — 1 (data ineq.) 

V = n, i 

t 

and the output is: 

f [max (0, < n < [^±^J 

71 = < [max (L, =3±L±tet±i £. )] < j < 

[ [min (U, = * ±k ±i ^ s ± i z£. ) j 

5 Although one of the processors may have an incomplete last block, 
this causes no problems as the loop inequalities will provide a tighter 
upper bound. 




L <i < U (loop inequality) 

6p + l < 2i + 3 < 
bp + / + 6 — 1 (data inequality) 

[ “ ~p~ 1 ] is the block size of distribution. FME 


IT ( r — 3 + f + 6p A 

= { max [ L - J 


< i < 


min U 


.. —4 + 6 + / + 6p 


)J} 


= AccESs(.4(s(t)),p) 


Once 1Z is found, AccESS(A(s(i)),p) is simply the set of 
z’s in TZ. A purely cyclic distribution is just cyclic(i) 
so this case is covered by setting 6 = 1 in the above ex- 
pressions. These results easily extend to multi-dimensional 
arrays distributed on multi-dimensional processor meshes 
Some special cases are worth mentioning. If an array 
dimension A(...,l : u, ...) is replicated, then its data in- 
equality is l < s(i) < u and involves no processor indices. If 
an entire array dimension is sequentiaiized (or collapsed) on 
a particular processor r, then instead of a data inequality 
the mask "IF p = r THEN” is applied to the Access set. 

To perform loop bounds reduction, PARADIGM just 
needs to extract the loop inequalities from the the original 
loop bounds and the data inequalities corresponding to the 
Ihs reference and its distribution, and send them to FME' 
Then, it uses the polytope returned to construct the new 
loop. 

Using the previous results for the Access set of a block 
distributed array .-L. the sequential loop: 

do i = L, U ■ 

A(2i + 3) = • • ■ 
end do 

becomes: 

P = mvp() 

&=[^T 

do i = [max (L, [min (U. zi±^i±i£)J 

-4(2z + 3) = • • ■ 
end do 

after loop bounds reduction, where myp() returns the pro- 
cessor's index in the mesh. Similarly, for a cyclic( 6 ) distri- 
bution the new loop is: 

P = mvp() 

_ f ( n ,4— f4*2£. — od \ 1 ! —bv- i ~2C/ | 

do n = | max (0. ^ j | , [ Vp J 

do i = [max (L, ) ] , 

[min [U, c+± 6 ±iii 2 ± 6 n£)j 
-4(2z + 3) = • ■ • 
end do 
end do 

When there axe multiple Ihs array references, a conser- 
vative union U of their Access sets is used to form the 
reduced loop bounds, while each assignment statement is 
masked with the corresponding ACCESS set of its Ihs. For 
example, in the following loop: 

do i = L,U 

A(2i + 3) = • • • 

R(i-1) = ... 

end do 

let both A and B have a block-cyclic distribution, and 
let their Access sets be {L„ A <n< U nA ,Li A < i < U iA ) 
and {L ng <n< U ng , L tg < i < Ui g }, respectively. Their 
union is: 

U = { ) <n <max(Un A ,Un B ) 

[ mini Li A , Li g ) < i < max(Ui A ,Ui B ) / • 

and so the loop becomes: 
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do n = min (Z„ A , l nB ), ma x(Un A , U ng ) 
do i = min(Li A , Li g ), max{Ui A ,Ui g ) 

if ({Ln A <n< Un A ) and. ( Li A < i < Ui A )) then 
A(2i + 3) = ■ • - 
end if 

if (( L ng < n < U nB ) .and. ( Li g <i< Ui g )) then 
3(i -!) = ••- 
end if 
end do 

end do 

Since block and cyclic distributions are special cases 
of block-cyclic, this method works for all types of regular 
distributions. In particular, if some of the Ihs terms are 
block distributed, then before finding their ACCESS sets, 
they are first cast into a block-cyclic form by adding a 
bnP term to the bounds in their data inequality, and adding 
an extra data inequality, 0 < n < 0. However, if all of the 
Ihs terms are block distributed, then this transformation is 
not necessary. 

3.2 Communication Generation 

If a processor p does not own ail of the elements of a right- 
hand side ( rhs ) array reference R required by a statement 
that it executes, then this data must be sent from its owner, 
say q, to p via inter-processor communication. To reduce 
communication overheads, array elements can be combined 
into a single larger message instead of being communicated 
individually. This optimization is generally known as mes- 
sage vectonzation [16]. 

,The approach is the basically similar to the work of An- 
court [4) which describes regions requiring communication 
using a set of linear inequalities and generates loops to scan 
these regions through a Fourier- Motzkin projection. Depen- 
dence analysis done by Parafrase-2 is used to determine the 
communication point, which is the outermost loop at which 
the combining can be applied. 

Therefore, the main roie of the compiler is to obtain rel- 
evant linear inequalities from the loop bounds, data decom- 
positions, and array references in the source program, send 
these inequalities to FME. and from its results generate the 
scanning loops to pack/send and receive/ unpack, and insert 
this code at the communication point. 

The set of iterations for which a processor p must receive 
elements of a rhs reference R from its owner processor q is 
called the Comm set of R, denoted Comm (R,p,q) (In previ- 
ous work [22], this was referred to separately as the Send set 
of q and the Receive set of p). To obtain the Comm set, the 
inequalities X needed are the loop inequalities 5 , the data in- 
equalities (parameterized by p) of the Ihs reference, and the 
data inequalities (parameterized by q) of the rhs reference 
R. The resulting poiytope 7Z is used to both pack/send data 
(by setting q =myp()) and receive/unpack data (by setting 
P =myp()). 

This process is demonstrated using the HPF program in 
Figure 4. The processor grid is a Pi x P? mesh' (hence 
0 < pi < Pi - 1 and 0 < p 2 < P* - I). The Ihs array 
.4 is distributed by cyclic(5) on the first mesh dimension 

8 From loops nested below the communication point only; index 
variables of outer loops are treated as symbolic constants. 

‘The usage of the processors declaration in the program is not 
standard, but it simplifies the specification of a variable number of 
processors in both mesh dimensions. 


real .4(170) 
real 5(120, 120). 

IHPFS processors P( Pi, P 2 ) 

!HPFS template T(170, 2) 

IHPFS align A(k) with T(k, 1) 

IHPFS distribute T(cyclic(5), block) onto P 
IHPFS distribute 3(cyciic(3), cyclic(7)) onto P 
do i = 3, 40 

do j = 2. i — 1 

.4(4t 4- 5) = B(2i 4- j — 1, 3t — 2 j 4- 1) 
end do 
end do 

Figure 4: Example Loop 

and sequentialized (p 2 = 0) on the second mesh dimension: 
i.e., only the processors whose second coordinate p 2 is 0 will 
own parts of .4. The first dimension of the rhs array B is 
distributed by cyclic(3) on the first mesh dimension, while 
the second array dimension is cyclic(7) on the second mesh 
dimension. 

First, the compiler uses data dependence analysis to de- 
termine that communication can take place outside the en- 
tire loop nest. Then, it calls FME with inequalities derived 
from the loop (as explained below), and extracts the Comm 
set from the solution 1Z returned to construct the commu- 
nication code to pack/send and receive/unpack data. The 


input sent to FME is: 



' 3 < i < 40 

(i-loop) 

• 

2 < j < i - 1 

(/-loop) 


0 < n < 

{Ihs) 


5pi 4-14-5* 4n<4i4-5< 



op\ 4-14-5* 4n 4-5 — 1 

{Ihs) 

I = < 

0 < mt < 

{rhsdi) 


3qi 4-14-3* 4mi < 2i 4- j — 1 < 



3qi — 1 4- 3 * 4mi 4-3 — 1 

{rhsdi ) 


0 < m, < 

( rhs d 2 ) 


lq 2 4- 1 4-' 7 * 2ni2 < 3i — 2j 4- 1 < 



7<72 4- 1 4- 7 * 2m2 4-7 — 1 

(rhsdz) , 

V = T3 

, mi, m 2 , i, j 



The loop inequalities come directly from the loop bounds. 
A parr of data inequalities comes from each of the three array 
dimensions involved (one from Ihs and two from rhs)\ each 
pair consists of an inequality bounding the block number 
and one bounding the subscript function. The block num- 
bers for the first and second dimensions of the rhs are mi 
and m 2 , respectively. Since the rhs determines the sender 
{qi, qz), the processor coordinate^ involved in these inequal- 
ities are qi and qz- Similarly, the block number for the Ihs is 
n, and the processor coordinate involved is only pi because 
the ihs determines the receiver (pi,p 2 ) and 14 is distributed 
only along the first processor dimension. Although p 2 is 
not involved in the inequalities, the compiler takes into ac- 
count the fact that only processors with coordinate p 2 = 0 
own parts of the Ihs and hence are potential receivers, and 
generates code accordingly (shown in Figure 5). The loops 
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{PACK/SEND phase: .processor (qx,qi) 
sends to processor (pi,pi = 0)} 

91 = mypiO 

92 = myp 2 () 

Pi - 0 

do 20 pi = 0, Pi — 1 
if ( (pi ,ne. gi) .or. (pi .ne. qi) ) then 
len = 0 

do 10 n = ceiling( max(0. reai(4 — 5pi 4- 4<7 i)/(5Pi), 
real(16 - 35pi 4- 24qi 4- 2Sp2)/(35i 3 l ), 
real(48 — 35pi 4- 28q?)/(35Pi), 
real(16 — 15pi 4- 28q2)/(15Pi), 
real(12 - 5pi)/(5Pt))), 
floor(min(real(3356 — lopi — 1 12t?-2 ) /(15P- 
real(164 - 5pi)/(5Pi), reai(236 - 5pi — i^-2 ) / ( 5 Pi ) . 
real(1124 — 5pt — 24<7 i)/(5Pl). 
reai(284 — 5pi — 7qi)/(oPi ), 
read(844 - 5pi - 21<72}/(5P. ))) 
do 10 mi = ceiling(max(0. real(4 - 3<7 i)/(3Pi), 
real(44 — 15pi — 15nP ; - 24qi)/(24Pi), 
realf— 548 4- 35pi 4- 35nPi — 24<7 i)/(24Pi ), 
realf — 16 4- 5pi -4 5nPi — ' 18qi )/(18Pi), 
realf— 8 4 - 5pi — 5nPi - 6 oi)/(6Pi), 
real(2 - 9^1 + 149*)/(9Pi))), 
floor(min(real(39 — <7 i)/P, 
reaiifl 124 — opi — 5nPi — 24?i )/(24Pi), 
realf —4 — 5pi ~bnP\ - 4qi )/(4Pi), 
realf — 16 4- 35pi 4- 35nPi - 24?i - 28q 2 )/(24Pi), 
real(276 - 691 - )/(6Pi )) ) 

do 10 mi = ceilingfmaxfO, 
realf— 20 4- opt 4 - 5nPi - 28t?2 )/(28ip> ) , 
realf -84 4- 35pi ~'24m\Pi 4-35 nP { 

-24pi -28q 2 )/(2SP?), 
realf" — 6miPi -.601 — 7qi)/(7Pi), 
realf— 3 4 - m v Pi - <71 - 7qi)/(7Pi))), 
floor(min(real(116 — ~qi)/(7Pi), 
realf — 16 4- 15pi 4- 15nPi - 23qi) / (2SPi) , 
realf— 16 4- 35pi — 24miPi t- 35nPi 
—2$qi—2$qi)/(2SPi), 
realf— 2 4 - 9m\P t 4 - 9qi - 14q2)/(14P?), 
real(276 — 6mtPi - 691 - 7qi)/(7Pi))) 
do 10 i = ceiling(maxf3. real(4 4- Tm^Pj 4- "q 2 )/3, 
realf — 4 4- 5nPi 5pi )/4. 1 4- mi Pi 4- <71 , 
miPi 4- <72 4- real(4 4- 6miP ; 4- 691 )/7)), 
ftoor(min(40. realf 5nPi 4- 5pi)/4. 
real(2 4- 3miPi 4- 3qi)/2, 4 4- 7m? P 2 4- 7qi 
2 4- m 2 P? 4- <72 + reaifOmtPi 4- 6qi)/7)) 
do 10 j — ceiling(max(2, 2 - 2i 4- 3miPi + 3?i, 
realf— 6 4- 3i — 7miPi - 7qi)/2)), 
floorfminf — 1 4- i, 4 — 2i 4- 3miPi 4- 3qi, 
real(3i - 7miPi -7qi)/2)) 
bufferflen) = 8(2i + j — 1, 3t — 2 j + 1) 
len = len4-l 
10 continue 

send(msgid(qi,q2), buffer, len*4. (pi,p2) ) 
end if 
20 continue 


{RECEIVE/UNPACK phase: processor ( pi,pi = 0) 
receives from processor (91,92)} 
pt = mypif) 
pi = myp2() 
if (p2 eq. 0) then 
do 40 91 = 0, Pi — 1 
do 40 <7? = 0. P? — 1 
if ( (pi ne. 91) .or. fp? .ne. 92) ) then 
len = 0 

recv(msgid(9i,92), buffer, BUFFERSIZE) 
do 30 n = Jj n (Pi,P*,Pi,9i,92), 

Un(P\ , Pit Pi , 9l , 92) 
do 30 mi = (Pi, P : , pi, 9 t , 92, n), 

U m \ ( Pi, P2, pi, 91, 92 - n) 
do 30 m? = Lm-, (P\ , Pi, pi, 91 , 92, n, mi), 
Um 7 (Pi,Pi,p\,qi,qi.n,mi) 
do 30 i = Z-i (Pi , Pj.pi, 91 , qi',n. mi, mi), 

■ Ui(Pi, P>, pi, 91. 92, n, mi, m?) 
do 30 j = L](Pi,P:.pi,qi,q 2 ,n,mi,mi,i), 
Uj(P\ , Pi.puqi.qi, n, mi, mi, i) 

B(2i 4- j — 1, 31 — 2 7 4 1) = bufferflen) 
len = len4-l 
30 continue 
end if 

40 continue 
end if 


[EXECUTION phase) 
if (pi .eq. 0) then 

do 50 n = ceiiing(real(12 - 5pi)/(5Pi)), 
floor(real(164 -5pt)/(5Pi)) 
do 50 i = ceiling(max(3. 
realf— 4 — 5nP; 4- 5pi j/4)), 
3oor(min(40. real(5nPi — 5pi)/4)) 
do 50 j = 2. — 1 4 i 

.4(41 4- 5) = B(2i'-r ; - 1. 3 i - 2; 4- 1) 
50 continue 
end if 


Figure 5: SPMD Code for the Example Loop 



in the receive/unpack phase have the same bound expres- 
sions as those in the pack/send phase, since both come from 
the same polytope. Therefore, we have oniy included their 
abbreviated bound expressions along with the variables of 
which these expressions are functions. The figure also in- 
cludes the execution phase, which is the result of the loop 
bounds reduction procedure described previously. 

In this example, the communication code essentially tra- 
verses the entire processor space since the program has an 
all-to-many communication pattern. However, the commu- 
nication pattern is often noc as complicated. For example, 
many stencil computations exhibit one-to-one (shift) com- 
munication patterns when their arrays are properly aligned. 
In such cases, it is useful to find receiving and sending pro- 
cessor sets. Then, the send phase only has to scan through 
the receiving processor set instead of the entire processor 
space. Similarly, the receive phase only scans through the 
sending processor set for candidate senders. The sending 
processor set can be found simultaneously with r.he Comm 
set in a single invocation to FME. To do this. FME is called 
with the same input that we used to find the Comm set 
as before, but with two additions. A processor inequality 
0 < q < P — I is added to X, and the variable q is prepended 
to the list V, as q is now the first 'polytope axis. Note that 
this set is parameterized by the receiver p. The same steps 
can be followed to find the receiving processor sec: simply 
replace q by p and proceed as before. Similarly, this set is 
parameterized by che sender q. 

4 Results 

Two small scientific programs, LU and POTENG. were com- 
piled using the techniques described in the previous sections. 
The communication generation is not fully integrated with 
the rest of the compiler yet. so we manually added the com- 
munication code generated by Mathematica. LU is a stan- 
dard LU matrix factorization code, and POTENG computes 
the potential energy in che molecular dynamic simulation 
program (MDG) of the Perfect Benchmarks [20]. These pro- 
grams were chosen because cyclic or block-cyclic distribu- 
tions are required to obtain reasonable performance. Boch 
programs were run on an IBM SP-2 using MPIF'[12). 

LU has three 1024 x 1024 2-D arrays, two distributed 
in a column-cyclic manner (U and the input matrix) whiie 
the other one (L) in a row-cyclic manner. The computation 
consists mainly of two doubly- nested loops (one for L and 
the other for U), both enclosed in an outermost loop. In 
each iteration of the outermost loop the program computes 
a column of L and a row of U, where L and U are trian- 
gular matrices. Both operations are performed in parallel 
and with reasonable load balance due to the cyclic distri- 
butions. Communication occurs outside the doubly-nested 
loops but inside the outermost loop. The communication 
pattern involves a broadcast of a row and a column in each 
iteration of the outermost loop. The broadcast initiator, 
the size of the broadcast message, and the number of re- 
ceiving processes depend on the outermost iteration. This 
makes it difficult to identify the broadcast. The speedup 
for up to 32 processors is shown in Figure 6. We can see in 
this figure that it is critical to identify the communication 
pattern as a broadcast, and therefore be able to use a MPI 
collective communication primitive instead of point-to-point 
communication. We axe currently extending the high-level 
communication detection features of PARADIGM to handle 
such cases. 



Figure 6: Speedup of LU 


POTENG has six 1-D arrays, three of size .V. where ;V 
is the number of molecules to be simulated; and che other 
three of size 3 N. where 3 is the number of atoms in a water 
molecule. A cyclic distribution is selected for the first three 
arrays while a cyclic(3) distribution (i.e.. block-cyclic 
with block size of 3) for the second. 

This subroutine has two main loops. The first one has 
statements of the form: 

double precision x(3 N), xm(N), cl, c2 

do i = l, -V 

xm(i) = cl * x(3i — 1) + c2 * (x(3i — 2) 4- x(3t)) 

• • • {similar computations with ym and :m} ■ ■ ■ 

whiie the second one is a computationally intensive trian- 
gular loop. The cyclic(3) distribution becomes a natu- 
ral choice to satisfy both communication and load balanc- 
ing requirements (in agreement with the decision made by 
PARADIGM’S automatic data partitioner [14j). 

Prerequisite steps for the parallelization of the triangular 
loop are the following: 

• Inlining three small subroutine calls. 

• Fairly complex induction variable elimination. 

• Array privatization of five small arrays. 

• Detecting two simple reductions. 

In our case we used the induction variable elimination 
of Para£rase-2 and manually performed the other steps. All 
of these steps are automatically performed by the Polaris 
compiler[8|. 

Communication operations generated by PARADIGM 
were combined outside the triangular loop, and reductions 
were performed using MPI collective communication primi- 
tives. Figure 7 plots the speedup for N = 3000 for up to 32 
processors. The encouraging speedup shown in the figure is 
a good indication that the distributions chosen are correct 
and that our compilation techniques are effective. 
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Processors 


Figure 7: Speedup of POTENG 


5 Conclusions and Future Work 

The techniques presented in this paper allows PARADIGM 
to deal with ail of che common regular data decompositions 
(block, cyclic, and block-cyclic) within a unified frame- 
work. By leveraging off the symbolic power of existing pack- 
ages like Mathematics . . PARADIGM has immediate access 
to symbolic capabilities that ultimately allows a much wider 
range of compilable programs than was possible before. 

The work presented does not translate global references 
to the local address space of a processor; we are working on 
better wavs to handle local translations uniformly for any 
of the supported data distributions. We are also extending 
previous work [14l on automatic detection of high-level com- 
munication to handle more complicated cases such as that 
encountered in LU. 
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